Non-linear Mapping for Improved Identification of 1300+ Languages
نویسنده
چکیده
Non-linear mappings of the form P (ngram)γ and log(1+τP (ngram)) log(1+τ) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781 languages. The second mapping improves four of the five identifiers by 10.6% to 83.8% on the larger corpus and 14.4% to 76.7% on the smaller corpus. The subset corpus and the modified programs are made freely available for download at http://www.cs.cmu.edu/∼ralf/langid.html.
منابع مشابه
The Effect of Concept Mapping on Iranian EFL Learners’ Vocabulary Learning and Strategy Use
This study aimed to investigate the effects of concept mapping on the extent to which Iranian EFL learners retain new vocabularies and the degree of awareness toward vocabulary learning strategies they tended to use. To this end, a total of 40 Iranian EFL students were asked to participate in this study. They were randomly assigned to two equal groups; namely, experimental and control. The part...
متن کاملA comparative study of quantitative mapping methods for bias correction of ERA5 reanalysis precipitation data
This study evaluates the ability of different quantitative mapping (QM) methods as a bias correction technique for ERA5 reanalysis precipitation data. Climate type and geographical location can affect the performance of the bias correction method due to differences in precipitation characteristics. For this purpose, ERA5 reanalysis precipitation data for the years 1989-2019 for 10 selected syno...
متن کاملCollocational Processing in Two Languages: A psycholinguistic comparison of monolinguals and bilinguals
With the renewed interest in the field of second language learning for the knowledge of collocating words, research findings in favour of holistic processing of formulaic language could support the idea that these language units facilitate efficient language processing. This study investigated the difference between processing of a first language (L1) and a second language (L2) of congruent col...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملIdentification of Multiple Input-multiple Output Non-linear System Cement Rotary Kiln using Stochastic Gradient-based Rough-neural Network
Because of the existing interactions among the variables of a multiple input-multiple output (MIMO) nonlinear system, its identification is a difficult task, particularly in the presence of uncertainties. Cement rotary kiln (CRK) is a MIMO nonlinear system in the cement factory with a complicated mechanism and uncertain disturbances. The identification of CRK is very important for different pur...
متن کامل